This is the Optimization Demonstration for Random Forests

Approach

Uncontrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Find the minimal combination of algorithm inputs that maximize accuracy. If there are ties, break them by using the point that requires the least data.
  3. Find the costs associated with running the algorithm with those inputs on all different hardware configurations.
  4. Find the combination of hardware that jointly minimizes cost and time.

Data-contrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the accuracy surface produced above to the amount of data available. The maximizing point will fall in the upper right corner of the subsetted space.
  3. Find the costs associated with running the algorithms with the accuracy-maximizing point on all different hardwares.
  4. Find the combination of hardware that jointly minimizes time and cost.

Cost-constrained Optimization

  1. Use a combination of prediction and interpolation to predict all values oftime and accuracy under different hardware/software configurations.
  2. Subset the space produced above to the amount of time and money able to be spent on modelling.
  3. Working backwards now, find the accuracies that can be produced in the limited time.
  4. Using the subset of accuracy space, find the combination of algorithm inputs that maximizes accuracy.
knitr::opts_chunk$set(cache=TRUE, echo=F, warning=F, error = F, message=F)
knitr::opts_knit$set(root.dir = "/users/scottsfarley/documents")
setwd("/users/scottsfarley/documents")
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(akima)
library(ggplot2)
options(java.parameters = "-Xmx1500m")
library(bartMachine)
## Loading required package: rJava
## Loading required package: bartMachineJARs
## Loading required package: car
## Warning: replacing previous import 'lme4::sigma' by 'stats::sigma' when
## loading 'pbkrtest'
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: missForest
## Loading required package: itertools
## Welcome to bartMachine v1.2.3! You have 1.4GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
library(reshape2)
library(ggdendro)

threshold.time <- 22 ##seconds
threshold.cost <- Inf ##cents
threshold.numTex <- 45

First, get the training data and fit the model. Perform some skill checks on it.

## serializing in order to be saved for future R sessions...done
## serializing in order to be saved for future R sessions...done

Choose a finite number of possible solutions to the model. Ideally, we would want every single combination of predictor variables [0, Inf]. This is obviously intractable. Moreover, I only have data for a subset of that space anyways. So randomly sample the subspace in which I have data to make the problem possible to solve.

Using that subset of data and the models we fit previously, predict each candidate configuration of algorithm inputs and hardware variables for execution time and SDM accuracy.

Plot the posterior means of the accuracy models against the algorithm inputs that should control accuracy. In this case, these are number of training examples and number of covariates.

The accuracy clearly varies from low (few training examples and few covariates) to very high (many covariates, many training examples). Perhaps more data would be helpful here, but what are you going to do. Our task is to find the combinations of inputs that results in the highest accuracy model. If there’s a tie, find the combination that needs the least data.

Now, we know the combination of algorithm inputs that result in the highest accuracy. The figure below shows the combination identified on the training examples and covariates axes. This combination of training examples and number of covariates can be run on any combination of hardware. Some might be suboptimal. Thus, at this point, we’ve solved half of our challenge: algorithm inputs have been optimized, now it’s time optimize hardware.

## [1] "Accuracy is maximized at 10000 training examples and 5 predictors."

In theory, the hardware parameters should not affect the SDM accuracy. We can test this assumption here, by plotting the accuracies obtained for this combination of algorithm inputs against modeled accuracy on the number of CPUs and amount of memory. If the assumption is valid, the plot should show no change in either the horizontal or vertical directions. We see that there is, in fact, some change, though. This is likely due to expeirmental design, and lack of a full factorial design setup. The effect is realtively minor, and I choose to comment it and move along.

## [1] "Accuracy Range on Hardware:  0.0105984120686514"
## [1] "Accuracy Range from Expectation:  0"
## [1] "------"
## [1] "Fixing accuracy at:  0.851811113052221"

Now, fix the algorithm inputs at the accuracy-maximizing point– effectively fixing expected model accuracy. An algorithm with these inputs can be run on any combination of hardware. Project how long that specific model would take and how much it would cost on all computing types. Plot those out on time vs. cost axes.

The optimal solution is the one that balances time and cost equally during the minimization. We use euclidean distance here, which normalizes each dimension by its standard deviation, so they are weighted equally. For each candidate combiantion of hardware, we calculate the distance between it and the origin of these two axes. We then find the minimum of that distance matrix and call that point the optimal.

Our job is complete. We’ve now optimized both the harware and software dimensions of the problem.

## [1] "------RANDOM FOREST OPTIMAL--------"
## [1] "Predicted Optimal Accuracy 0.851811113052221 +/- 0"
## [1] "Predicted Optimal Cost (seconds) 61.102049731714"
## [1] "Predicted Optimal Cost (cents) 9.18363807467662"
## [1] "Cores:  5"
## [1] "Memory: 2"
## [1] "Training Examples: 10000"
## [1] "Covariates: 5"

Everything up to this point was done using the mean of the posterior distribution, a choice which simplifies the process but causes some information loss and may cause over-confidence in the predictions. We can modify our steps to include information from the entire posterior, which may solve this issue.

Instead of projecting just the mean time and mean cost for use the the distance minimization, use the entire set of posterior samples. Calculate the distance metric for each sample in the posterior independently. You’re then left with a density distribution of distances, from which we can infer the minimum value.

The posteriors are in a line, since there’s a fixed linear relationship between time and cost.

Now, find the distance metrics for all of those points.

There’s a lot of overlab in this figure, and many points are far away from the optimal. We don’t care about those. Take the few closest to the minimum and look at their distributions.

Now, the optimal configuration may be one of the following:

config cores GBMemory seconds cost distance.mean distance.sd
73 7 2 61.94721 13.03493 65.57112 17.67331
97 9 2 60.75088 16.43554 66.15120 21.89970
109 10 2 61.10775 18.36899 67.03575 22.05793
121 11 2 61.19141 20.23355 67.62757 21.96745
133 12 2 61.42564 22.15746 68.51374 22.23448
145 13 2 61.58649 24.06677 69.41221 22.66935
157 14 2 61.14458 25.73209 69.74176 23.05284
169 15 2 60.71479 27.37630 70.19451 23.98477
204 17 22 72.26293 283.11317 71.44524 28.26655
193 17 2 58.90948 30.10392 71.51950 29.00230
216 18 22 71.43575 296.33551 71.96854 29.19276
181 16 2 62.99533 30.29823 73.84766 25.34725
228 19 22 71.44955 312.85900 76.15298 33.14503
240 20 22 70.75760 326.13591 78.46303 34.54157
252 21 22 71.69278 346.96866 78.65747 33.68780
264 22 22 71.29078 361.45279 79.44667 34.01575
276 23 22 71.05274 376.62072 81.88576 36.55447
194 17 4 80.08126 68.20521 111.92672 40.11537
205 18 2 58.29714 31.54341 113.76909 42.19527
217 19 2 57.82351 33.02532 116.28979 43.49834
195 17 5 80.08126 81.84625 121.84012 43.66841
206 18 4 79.21041 71.43195 124.48964 46.17136
58 5 18 84.28311 80.22909 126.85322 54.67832
229 20 2 59.66210 35.86886 127.41700 66.68422
241 21 2 60.48822 38.18380 134.02886 75.79041

config cores GBMemory seconds cost distance.mean distance.sd cluster
73 7 2 61.94721 13.03493 65.57112 17.67331 1
97 9 2 60.75088 16.43554 66.15120 21.89970 1
109 10 2 61.10775 18.36899 67.03575 22.05793 1
121 11 2 61.19141 20.23355 67.62757 21.96745 1
133 12 2 61.42564 22.15746 68.51374 22.23448 1
145 13 2 61.58649 24.06677 69.41221 22.66935 1
157 14 2 61.14458 25.73209 69.74176 23.05284 1
169 15 2 60.71479 27.37630 70.19451 23.98477 1
193 17 2 58.90948 30.10392 71.51950 29.00230 1
181 16 2 62.99533 30.29823 73.84766 25.34725 1
In the re sults ab ove, you’re accutally seeing the trade off betwee n time and mon ey play out quite nicely. Adding cores costs money, but, in the case of random forests, reduces time. Here, that tradeoff basically exactly evens out.

Data Constraint

In this case, we’ve got a constraint on the amount of data available to us.

## [1] "Current data threshold is  45"
## [1] "Accuracy is maximized at 1 training examples and 5 predictors."
## [1] "Expected Max Accuracy is  0.786606226825189"
## [1] "Now there are only:  287 candidates, instead of  287 candidates that can complete this scenario under budget."

## [1] "Recommended # cores:  6"
## [1] "Recommended Memory:  2"
## [1] "Expected Cost:  1.04649454249435"
## [1] "Expected Seconds:  5.80225406129049"

config cores GBMemory seconds cost distance.mean distance.sd
61 6 2 5.802254 1.0464945 6.113652 1.676867
49 5 2 5.851315 0.8794526 6.121451 1.626490
97 9 2 6.146885 1.6629782 6.756843 2.342430
109 10 2 6.160349 1.8518010 6.823160 2.358761
121 11 2 6.162140 2.0375733 6.892751 2.423780
133 12 2 6.184196 2.2307630 7.010260 2.577369
145 13 2 6.188951 2.4185185 7.124181 2.749322
157 14 2 6.186698 2.6036098 7.201597 2.789247
169 15 2 6.417694 2.8937381 7.662130 3.288455
216 18 22 14.195518 58.8869842 10.375337 5.036876
204 17 22 14.661601 57.4415117 10.397984 4.902186
193 17 2 8.422389 4.3040091 10.564643 5.023653
228 19 22 14.052031 61.5301907 10.975582 5.212812
264 22 22 14.209866 72.0457249 11.422386 5.441166
276 23 22 14.153104 75.0196617 11.708109 5.629797

config cores GBMemory seconds cost distance.mean distance.sd cluster
61 6 2 5.802254 1.0464945 6.113652 1.676867 1
49 5 2 5.851315 0.8794526 6.121451 1.626490 1
97 9 2 6.146885 1.6629782 6.756843 2.342430 1
109 10 2 6.160349 1.8518010 6.823160 2.358761 1
121 11 2 6.162140 2.0375733 6.892751 2.423780 1
133 12 2 6.184196 2.2307630 7.010260 2.577369 1
145 13 2 6.188951 2.4185185 7.124181 2.749322 1
157 14 2 6.186698 2.6036098 7.201597 2.789247 1
169 15 2 6.417694 2.8937381 7.662130 3.288455 1

Cost Constraint II

config cores GBMemory seconds cost distance.mean distance.sd
136 12 6 21.42370 18.03190 3.473044 0.6116520
132 11 22 16.14297 40.92340 3.528074 0.5187391
171 15 5 21.99952 19.83917 3.618800 0.6486117
195 17 5 21.99623 22.48103 3.726266 0.6641395
207 18 5 21.99742 23.80473 3.774779 0.6706958
219 19 5 21.99826 25.12817 3.838248 0.6937793
148 13 6 21.37714 19.49211 4.308378 0.6491336
160 14 6 21.32160 20.93696 4.403905 0.6763910
172 15 6 21.99952 23.14570 4.586430 0.7160127

config cores GBMemory seconds cost distance.mean distance.sd cluster
136 12 6 21.42370 18.03190 3.473044 0.6116520 1
171 15 5 21.99952 19.83917 3.618800 0.6486117 1